Some Two-Step Procedures for Variable Selection in 
High-Dimensional Linear Regression 
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ABSTRACT 

We study the problem of high-dimensional variable selection via some two-step pro- 
cedures. First we show that given some good initial estimator which is £oo-consistent 
but not necessarily variable selection consistent, we can apply the nonnegative Gar- 
rote, adaptive Lasso or hard-thresholding procedure to obtain a final estimator that 
is both estimation and variable selection consistent. Unlike the Lasso, our results do 
not require the irrep resentable condition which could fail easily even for moderate pn 
(|Zhao and Yul . 120071 ) and it also allows pn to grow almost as fast as exp(?7-) (for hard- 
thresholding there is no restriction on pn). We also study the conditions under which 
the Ridge regression can be used as an initial estimator. We show that under a relaxed 
identifiable condition, the Ridge estimator is £oo-consistent. Such a condition is usually 
satisfied when Pn < n and does not require the partial orthogonality bet ween relevant 



and i rrelevant covariates which is needed for the univariate regression in (jHuang et al. 



20081 ) . Our numerical studies show that when using the Lasso or Ridge as initial esti- 
mator, the two-step procedures have a higher spar sitv recoverv rate th an the Lasso or 
adaptive Lasso with univariate regression used in (jHuang et al.l . l2008l ) . 



Keywords: variable selection, nonnegative Garrote, adaptive Lasso, hard-thresholding, 
variable selection consistency, oracle properties 



I. Introduction 



Consider the linear regression model 



Y = X(3* + e 



where X G 



Bnxp 



is the design matrix, Y G 



is the response vector, /?* G 



9X 1 



is the 



unknown parameter, and errors e = [ei, . . . , e„]-^ are iid normal, i.e. e ~ A^(0, a^J). We are 
interested in regression with diverging number of parameters, and will use p„ to denote the 
number of variables which can grow as n — > oo. 

The key assumption for such high-dimensional estimation problems to be feasible is that 
the true parameter (3* is sparse. Let S be the subset of indices such that S = 7^ 0} 

and denote Sn = \S\, the cardinality of the set S. The sparsity assumption means that 
the number of relevant variables s„ is much smaller than p„, i.e. s„ -C Pn- Under such 
a cond ition, effici e nt est imation and variable selection become possible. For example, the 
Lasso (ITibshiranil . Il996l ) which minimizes least squares with the ii penalty 



Lasso 



arg min — \\Y 
2n 



Pn 



(1.2) 



has been proposed for such problems. Due to the ii penalty, the solution of Lasso is 
usually sparse with an appropriately chosen penalty parameter A„. Such a property has 
made Lasso a very desirable candidate for variable selection. Computationally, the esti- 
mation of Lasso is a convex optimization problem and can be solved efficiently. Further- 
more, it has been shown that the full solution path o f Lasso can be found at the same 

■ ■ ■ ■ ■ 1 



cost of solving the least squares estimation problem (lOsborne et a 



20041). People have also studied various theoretical properties of Lasso fiFu and Knightl.l2000i : 
Greenshtein and Ritov . 2004 : Meinshausen and Biihlmann . 2006 : Zou, 20061 : Zhao and Yul . 



2OO7I : lYuan and Linl . l20oi 



Bickel et al 



2007 



200G: Efron et al 



Wainwrightl. 12006 ). One interesting property 



found by several authors (IMeinshausen and Biihlmannl . l2006l : IZod . l2006l : IZhao and Yul . 120071 ) 
is that Lasso is not variable selection consistent in general, and a co ndition on the design 
matrix (called the irrepresentable condition in ( Zhao and Yu . 20071 )) is needed to ensure 
its varia ble selection consistency. For high-dimensional inference with increasing Pn , several 
studies (IMeinshausen and Biihlmannl . l2006l : IZhao and Yul . l2007l : IWainwrightl . l2006l ) showed 
that under the irrepresentable condition. Lasso is also variable selection consistent if addi- 
tional conditions on p„, s„, n and A„ are satisfied. In particular, it has been shown that Pn can 
be allowed to grow almost as fast as exp(n) when the error is normally distributed. Although 
such theoretical results a re very encouraging for the Lasso in high- dimensional problems, it 
has been pointed out in ( Zhao and Yul . 2007) that the key irrepresentable condition on the 
design matrix can easily fail even for moderate Pn- 



On the other hand, it is shown in (IFan and Lil . 1200 ll : IZoul . l2006l ) that even if the irrep- 
resentable condition is satisfied and the Lasso is variable selection consistent, there does 
not exist a tuning parameter which can lead to both efficient estimation and consistent 
vari able selection. It is argued that the desired estimator should possess the oracle proper- 
ties (IFan and Lil . I2OOII ). i.e. it should be variable selection consistent and the estimation of 
the nonzero parameters should be efficient. As a result, the SCAD method h as been pro- 
posed and studied for bot h the fixed and increasing pn setting with p^/n — >• (iFan and Lil. 



2OOII: Fan and Pengl . 120041 ). and it has be en shown to have the oracle p roperties. iHuang et al. 



( 120081 ) showed that the bridge estimator (IFrank and Friedmanl . Il993l ) for linear model, which 
has a penalty term A„ ^^=1 IPjl'^ for < 7 < 1, also has the oracle properties under certain 
conditions when pn < n. However, since the penalty functions of both the SCAD and the 
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bridge estimator are non-convex, it is more difficult to solve such optimization problems and 
in general there is no guarantee to find the global minimizer efficiently especially when the 
number of variables is large. 

Recently several two-step procedures have been st udied for va riable selection. The adaptive 
Lasso approach, which was recently proposed by IZoul (l2006l ). uses a weighted £i penalty 
with weights determined by an initial estimator. In other words, the adaptive Lasso can 
be thought as a two-step procedure by applyin g the Lasso to some transformed design with 



the initial estimator. For fixed p„, IZoul (120061 ) showed that if the initial estimator satisfies 



certain conditions related to est i matio n consistency, the adaptive Lasso estimator has the 

( 20061) further extended the results of the adapt ive Lasso 



oracle properties, 
with increasing pn 



Huang et al 



Yuan and Lin 



(120071 ) studied the nonnegative Garrote method (IBreiman 



19951 ) for fixed Pn and proved that when supphed with some good initial estimator which is 
£oo-consistent, the final nonnegative Garrote estimator is variable selection consistent. There 
are several o t her w ork which adopt su ch two-step procedu res, such as the Lars-OLS hybrid 



(Efron et al. 



20041) . the relaxed Lasso (iMeinshausenl. 120071). the sure i ndependence screening 



(IFan and Lvl . l2008l ). the one-step sparse estimator (jZou and Lil . 120081 ). etc. Most of the two- 
step procedures are computationally simple and do not require the irrepresentable condition 
on the design matrix, and some of them have been shown to have the oracle properties under 
certain conditions. However, the success of such two-step procedures depends crucially on 
the existence of a good initial estimator, which is not trivial to establish and also requires 
conditions o n the design matrix especially for high-dimensional problems. For instance, 
Huang et al.l (120061 ) used the univariate regression as the initial estimator in the adaptive 
Lasso and showed that a partial orthogonal condition is needed in order for it to satisfy the 
required condition in the second step. 

In this paper we study several two-step procedures as well as the Ridge estimator as the initial 
estimator for high-dimensional problems. In Section [2] we first study under which conditions 
the nonnegative Garrote, adaptive Lasso and hard-thresholding procedures can turn an i^o- 
consistent estimator into a final estimator that is variable selection consistent. With some 
minor conditions on the penalty parameter Xn, we show that both the n onnegative Garrot e 
and adaptive Lasso estimators also have the oracle properties as defined in lFan and Lil (120011 ). 
In Section [3] we study the conditions under which the Ridge estimator is £oo-consistent. The 
condition on the design matrix and true para meter is usua l ly sat isfied when p„ < n and does 
not require the partial orthogonal condition (IHuang et al.l . |2008|) when p„ > n. Encouraging 
numerical results are provided in Section HI Those two-step procedures with the Lasso 
or Ridge estimator as initial estimator are shown to have a higher success rate in terms 
of sparsity recovery than both the Lasso and adaptive Lasso with univariate regression 
as initial estimator. Results on prediction error also show that the adaptive Lasso with 
the Ridge initial estimator becomes more favorable when there exist stronger correlations 
between covariates. 



II. Two-step Procedures for Variable Selection 
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In the following we assume that an initial estimator can be obtained. For notational 
simplicity, we will use /? to denote the initial estimator, and also define A* = diag(/3j', . . . , 

and A = diag(/5i, . . . , respectively. We study several two-step procedures obtained using 
X, Y and the initial estimator f3. 

We use to represent the subvector of j3* which only contains entries j G S, and it is 
obvious that /3^c = 0. Similarly we use Xs and X^c to denote sub-matrices of the design 
matrix X which only contains columns in S and S'^, respectively. Since we are mainly 
interested in the situation with pn increasing, we also define p„ = minjg5|/3*| which is 
allowed to converge to zero at a relatively slow rate. Throughout the paper, we assume that 
maxj 1/5*1 < oo. 



oo 



Assumption 1 Assume that the initial estimator (3 is an ioo- consistent estimator of (3* , 
and \\I3 — /3*||oo = maxj \[3j — (3*j\ = Op(5„) for some sequence 5„ — > such that 5„ = o{pn)- 

Although we assume that the initial estimator is a good approximation to the true parameter 
/?*, we do not assume that j3 can exactly recover the sparsity pattern of (3* , since that often 
requires a stronger condition on the design matrix, as in the the case of the Lasso estimator. 
It turns out that for two-step procedures to be variable selection consistent, the £oo-consistent 
condition is sufficient. Note that similar conditions for the initial estimator have been used in 



earher work (jZoul . l2006l : iHuang et al.l . l2006l : lYuan and Liru . 120071 ) . It should also be obvious 



that in order for later procedures to separate variables in S from those in S'^, we need to 
have Pn converging to zero at a slower rate than 

For any vector [3 G W", we define its support as supp(/3) = {j : I3j ^ 0}. A procedure is 
called variable selection consistent if its sequence of solutions /5„ as a function of sample size 
n satisfy 

lim P(supp(^„) = supp(/3*)) = 1. (2.1) 

n^oo 

Furthermore, we also consider a slightly stronger property called sign consistency, which is 
defined by 

lim P(sign(J„) = sign(/5*)) = 1 (2.2) 

n— >oo 

where sign(t) = —1,0, 1 when t < 0, t = and t > respectively. All our results about 
variable selection consistency trivially imply sign consistency as long as the initial estimator 
is £oo-consistent with rate faster than p„. 

A. Nonnegative Garrote 

Let X and Y be the design matrix and response vector, and assume that some initial estima- 
tor j3 for the unknown parameter j3* is given. Let Z = XA, the nonnegative Garrote estima- 



tor (jBreimanl . Il995[ ) P^'^ is defined as = Pjdj for j = 1, . . . , p„ where d = {di, . . . , dp^Y 



is the minimizer of 



l-\\Y-Zdr + KY.T=,d, (2.3) 
cii >0 for J = l,...,p„. (2.4) 
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Although the initial estimator for the nonnegative Garrote method wa s originally def i ned as 
the least squares estimator, it does not need to be so. In particular, I Yuan and LinI (120071 ) 
considered a more general initial estim ator for the nonnega tive Garrote method with fixed 
Pn- Our result here is an extension of lYuan and LinI (120071 ) as we give a general sufficient 
condition for the nonnegative Garrote to be variable selection consistent in terms of the triple 
{n,pn, Sn)- We start with a Lemma which is a direct consequence of the Karush-Kuhn- Tucker 
(KKT) condition in convex optimization. 



Lemma 2.1. For any A„ > and Z = XA = Xdiag(/3i, /32, . . . , /?p„) where (3 is some 
initial estimator of (3* , assume that {ZgZs)~^ exists. Then there exists a solution of the 
nonnegative Garrote that exactly recovers the sparsity pattern if and only if 



n 



n 



-Z^Xsl3*s + -Z^e-X^l] > 

n n 



^3'= ~ ^siZg Zs) ^Zg) e + XnZgcZs{Zg Zs) ^1 < A^l 



(2.5) 
(2.6) 



where and 1 are vectors composed of 's and 1 's respectively, and the inequalities hold 
element-wise. 



The assumption that the Sn x s„ matrix Z^Zs is invertible is quite reasonable. It implies 
two conditions: (1) (XjX^)"^ exists; (2) f3j ^ for all j G 5*. The first condition is usu- 
ally needed in order to estimate Pg, and the second condition is satisfied as long as the 
initial estimator f3s is element-wise close to the true parameter f3g asymptotically. Further- 
more, inequality (12. 5p and (12.61) imply that there is no under-selection and over-selection, 
respectively. 

We will use Amin(-) to denote the minimum eigenvalue operator, and in particular, we also use 
Ainin to denote the lower bound of Ajnin(-^5-^s'/'"')- The following result gives the conditions 
of the sparsity level s^, the total number of predictors Pn and the regularization parameter A„ 
under which the nonnegative Garotte estimator j3'^'^ (or d equivalently) can correctly recover 
the sparsity pattern as n — oo. In other words, the nonnegative Garrote procedure is variable 
selection consistent when /? is a good initial estimator and the quantities {n,pn, Sn, A„, p„, 6n) 
satisfy certain conditions. 

Theorem 2.2. (Nonnegative Garrote) Under Assumption and further assume that 

\\XlXs{XjXs)-'\\oo < C^ax<+oo (2.7) 

Amin^^XjX^^ > A^i, >0. (2.8) 

Then the nonnegative Garrote estimator (3^^ is variable selection consistent, i.e. 

lim P (s\gn($^^) = sign(/3*)) ^ 1 (2.9) 

ra— >oo V / 
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as n oo, if the following conditions hold: 



Xr 



p2 
rn 



0, — a/s„ \0gSn/n 0, ^ A/log Pn/ 



(2.10) 



First, the irrepresentable condition for the Lasso is \\XjcXs{X'^Xs) ^sign(/3^)||oo < 1. A 



shghtly stronger condition that does not depend on (3* is \\XgcXs{XgXs)^'^\\oo < 1- Here 
we only need to have \\XgcXs{X'gXs)~^\\oo < C'max < oo for the nonnegative Garrote if we 
have some good initial estimator f5. This is mainly because 



\ZgcZs{ZgZs] 



-11 



A'^cXjcXsiXjXs) ^^s^ 

< Op{6n/Pn) \\XlXs{X^Xs) 



oo 
-111 



(2.11) 
(2.12) 



and 6n = o{pn)- Also, the boundedness of Cmax and Amin in equation (12.71) and (12.81) are only 
assumed to simplify the results and more general conditions can be obtained by allowing 
them converging to oo and slowly. In practice, one may set the penalty parameter A„ 
proportional to A/logp„/n. Assuming p„ is bounded away from 0, the above conditions 
state that pn can increase almost as fast as exp(n), which is a well-known condition about 
{pn-,n) for the Lasso in high-dimensional variable selection. The stringent condition on the 
design matrix now has been replaced by the condition that we have a good estimator j3 such 
that maxj \l3j — I3* \ = Op{6n)- 



Properties of the nonnegative Garrote estimator were studied in ( lYuan and Linl . 120071 ) for 
fixed Pn- Although it was suspected that the nonnegative Garrote estimator might be efficient 
in estimation, it was only shown that maxj IPf^ — P*\ = Op{6n) for a general design matrix, 

that is, they only showed that (3^'^ is no more better than the initial estimator (3 in terms 
of estimation. In the following we show that with some additional conditions, the final 
nonnegative Garrote estimator is i i i fact efficient in estimatio n, i.e. it has the oracle properties 
dFan and Li I2OOII : IPan and Pei^ . l2004l : iHuang et all l2006h . 



Theorem 2.3. Let xf be the i-th row vector of X (i.e. Xi is the i-th observation), and 
denote xT 



Under assumptions in Theorem 12.21 and additionally 
KVns^/pn 0, 



n max(x55)Xi(s))^/^ 0, 

l<t<n ^ ' 



then, 



V^w-\l0r - f^s) -DAr(0,l), 
where w"^ = c^f^ [^XgXs) ^ Vn for any x 1 vector t>„ satisfying ||f„||2 < 1- 



(2.13) 
(2.14) 

(2.15) 



Condition 12.141 is usually satisfied if we normalize covariates and s„ does not increase too 
fast. Condition 12.131 says An should converge to zero at a rate faster that ra"^/^ to ensure 
efficient estimation. In particular, if we assume p„ is bounded away from zero, Sn = 0(1), 
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Pn = exp(?2^ '^1) and 5n = n then condition 12. 101 in Theorem 12.21 together with condition 
12.131 can be satisfied if we choose A„ = n~'^^ with h < C2 < 



B. Adaptive Lasso 



Given some initial estimator /? and define Z = XA, the adaptive Lasso estimator (jZou . 
2006h is defined by 



P 



'ALa 



argmin — ||r - X/5f + A„ V 



(2.16) 



i=i 



where 7 > is some tuning parameter. Considering the case 7 = 1, it is easy to see that the 



above definition is equivalent to 
of 



ALasso 



Pjdj for j 



, Pn with d being the minimizer 



d 



arg mm 



Pn 

-\\Y - Zdr + XnJ2\d 



(2.17) 



Zoul (I2OO6I ) studied properties of the adaptive Lasso for fixed p„ and showed that it has the 



oracle properties. 

The adaptive Lasso and the nonnegative Garrote, bot h de p ending on some initial estim ator, 
are in fact closely related. It was pointed out in (IZom l2006l : lYuan and Linl . 120071 ) that 
solution of the nonnegative Garrote coincides with solution of the adaptive Lasso when 



additional constraints /3j/3* > (j 



,Pn) are imposed. Consequently, those two 



methods behave very similarly when the initial estimator is of high quality. The following 
Lemma (jWainwrightl . 120061 ). similar to Lemma [2.11 follows from the KKT condition of the 
adaptive Lasso optimization problem. 



Lemma 2.4. For any A„ > and Z = XA = Xdiag(/9i, P2, ■ ■ ■ , Pp„) where (3 is some initial 
estimator of (3* , assume that {ZgZs)~^ exists. Then there exists a solution of adaptive Lasso 
that exactly recovers the sparsity pattern if and only if 



d* 
"5 



1 r 

-zlz. 



n 



s^s 



-Zge- A„sign(4 



n 



> 



(2.18) 

< A„l (2.19) 
where and 1 are vectors composed of O's and I's, and the inequalities hold element-wise. 



Z^Zs [ZlZs] ' (^^Z^e - A„sign(4)^ - ^Z^, 



The following two theorems show that under exactly the same conditions as the nonnegative 
Garrote, the adaptive Lasso has the orac le properties. Similar result for the adaptive Lasso 
has been obtained in iHuang et al.l (120061 ). 



7 



Theorem 2.5. (Adaptive Lasso) Under the same conditions as in Theorem \2.^ the adap- 
tive Lasso estimator is variable selection consistent, i.e. 

lim P fsign(^^^'^n = s\gn{(3*)] ^ 1. (2.20) 

n— +00 \ / 

Theorem 2.6. Under the same conditions as in Theorem \2.3l the adaptive Lasso estimator 

pALasso satisfies 

^w-'vl{M'"'''° - Ps) iV(0, 1), (2.21) 
where w"^ = a'^v'^ {^X'^Xs) ^ fn for any s„ x 1 vector Vn satisfying ||f„||2 < 1- 



C. Hard-thresholding 

The hard-thresholding procedure is extremely simple and efficient. Given some initial esti- 
mator f3 and A„ > 0, define the hard-thresholding estimator as 

8^^= I '^if^l (2.22) 
' \ 0, if 1/3,1 <A„. ^ ^ 

Then we have the following results. 

Theorem 2.7. (Hard-Thresholding) Under Assumption [I] and choose A„ such that 6n = 
o(A„) and A„ = o{pn)- Then the hard-thresholding estimator j]^'^ is variable selection 
consistent, i.e. 

lim P (s\gn0^^) = sign(/5*)) ^ 1. (2.23) 

Thus this simple hard-thresholding estimator can achieve variable selection consistency as 
well if given some good initial estimator (3. Compared to the previous two methods, it can be 
directly obtained without any sophisticated optimization and has no restriction on how fast 
the number of variables p„ and the number of relevant variables s„ can grow. On the other 
hand, it requires that the rate of the threshold A„ must be greater than 6n to ensure the 
variable selection consistency no matter how fast pn grows. Such an explicit relation is not 
needed for both the nonnegative Garrote and the Lasso, since a smaller growth rate of Pn can 
make |^-\/logpn/n — even if 6n > A„. Hence the choice of A„ for the hard-thresholding 
procedure is more sensitive, as least from the theoretical perspective. Furthermore, it is 
obvious that the convergence rate of the resulting estimator P^'^ keeps the same as P, i.e. 
we have max, /?*| = Op(5„). However, it is possible to apply yet another fitting method 

using only the subset of selected variables to obtain much better rate of convergence. 

In practice, we may simply choose the hard-thresholding procedure when we know the initial 
estimator is £oo-consistent with fast convergence rate. Otherwise, the adaptive Lasso or the 
nonnegative Garrote might be preferred for the second step estimation and selection. We 
found that the latter two approaches are quite similar in terms of both theoretical properties 
and finite sample performance as we will see in Section HI 
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III. Initial Estimators 



Clearly the success of all previous procedures crucially depends on the existence of a good 
initial estimator, in the sense that max^ — Pj \ = Op(5„) for some sequence 6n —>■ 0. For 
Pn fixed we could use the ordinary least squares (OLS) solution as the initial estimator. 
For Pn increasing we have several choices. The simplest one is to use univariate regression 
(aka marginal regression), which calculates the esti mator coordinate by coordinate sepa- 
rately, i.e. f]^'"'^'" = X^Y . iHuang et al.l (120061 . l2008l ) have used univariate regression as an 
initial estimator in their paper for the high- dimensional adaptive Lasso, and showed that 
under some partial orthogonality condition and other conditions the univariate regression 
estimator guarantees the zero-consistency that is closely related to the £oo-consistency. The 
partial orthogonal condition, which states that ^XjcX^ = 0(l/-\/n), in fact implies the 
irrepresentable condition asymptotically as long as s„ does not grow too fast. 



Another choice is to run Lasso first and use j3^"-'^^° as the initial estimat or. iLounicil (12008 ') 
studied the C^o convergence rate of both the Lasso and the Dantzig selector (ICandes and Tao. . 



20081 ). which requires the off-diagonal elements of -X'^X to be small. Unfortunately, such 
a condi tion is quite strong and in f act implies the irrepresentable condition on the design 
matrix. iMeinshausen and Yul (120061 ) showed that the Lasso estimator is ^2-consistent 

under some sparse eigenvalue conditions. Since £2-consistency — /3*||2 = Op(l) implies 

that 11^^"**° _^*||^ = Op{6n) for some 6n 0, we can use the Lasso estimator as our initial 
estimator. They also pointed out that the conditions under which the Lasso is ^2-consistent 
are not as strong as the irrepresentable condition which could fail easily even if p„ < n and 
the design m a.trix is of ful l rank . Other works which stud y the i^ or ^^-cons i stency of the 



Lasso include iBickel et al.l (120071 ) . Ivan de Geerl (l2006l ) and IZhang and Huang (]2008l ). which 
require similar sparse eigenvalue conditions on the design matrix. 



We no w consider another popular regression technique, the Ridge regression (iHoerl and Kennard 
1970al jbh. which is more suitable for regression with correlated predictors. The Ridge esti- 
mator fj^^'^de jg (defined as the minimizer of the following objective (for some > 0): 



arg mm — 

/3 n 



\Y-X/3f + Ur, 



(3.1) 



Our main result is that with a properly chosen regularization parameter the Ridge es- 
timator fj^^'^ae is ^Q^-consistent and thus satisfies our condition as an initial estimator. The 
following key assumption is needed in order to establish the £oo-consistent result. 



Assumption 2 Let ei, . . . , e^, eg+i, . . . , be the singular vectors of the symmetric matrix 
^X'^X that corresponding to the singular values di > . . . > dg > (ig+i = . . . = dp^ = 
where q is the rank of ^X'^X satisfying q < min(n, p„), and let (3* = Yl^Zi^j^j ■ Assume 
that II %^illoo = 0{C,n) with some sequence ^„ 0. 

The requirement || ^j^jlloo = 0(^„) is obviously weaker than Yl^=q+i^'j — ^(^n)- 

Assumption [2] essentially says that the majority mass of j3* belongs to the column space of 
^X'^X asymptotically, i.e. /3* ~ {^X^X)h for some b G W" as n ^ oo. First, notice that 
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the assumption is automatically satisfied when n < Pn and X'^X has full rank. However, 
this is not the case for the irrepresentable condition which still requires that those irrelevant 
predictors cannot be represented by the relevant predictors in the true model. When p„ ^ n 
and X'^X is singular, let us consider the set Q = {6 : Xf5* = X6}. In this case, although 
any 6* G is equally good in terms of predicting Y, there is only one true parameter (3* 
among many choices. For any penalized linear method to recover the true parameter j3*, its 
penalty term has to favor j3* over any other 6 E Q. The condition in Assumption [2] can be 
thought as some relaxed identifiable condition for the Ridge regression to be £oo-consistent. 

Theorem 3.1. Under Assumption \^ the Ridge estimator fj^^'^d^ satisfies the condition 
maxj — p*\ = Op(l) as long as 

— ^ -> and \ 0. 3.2 

nvn dg 

Furthermore, letting = { '^''^"^^" Y^^ and if = 0{i'n\/s^/ dq), we have 

,„„|J«*-^;| = 0,((^^^)'"^ (3.3) 

First of all, note that when dq is bounded away from and s„ = 0(1), the result holds for 
Pn = exp(n^~'^i), Un = jQ^g as Ci > C2 > 0. Such conditions can be easily satisfied 

for most high-dimensional linear regression problems. Notice that for the Ridge estimator 
to be £oo-consistent, there is no constraint putting on the p„ as small coefficients do not play 
as important roles as in the case of variable selection. When Assumption [2] does not hold, it 
is easy to see that the results of Theorem 13.11 still holds for /5*'s projection J2'j=i^j^j- The 
following result shows that unlike the £oo-consistency, the Ridge estimator is in general not 
^2-consistent with a diverging number of parameters. 

Corollary 3.2. The Ridge estimator jS^^'^de 

is in general not i2-consistent even when (3* is 

sparse and Pn < n. 

The main reason for the ridge estimator not being ^2-consistent is because the large number 
of parameters cancel out th e increasing sample size. The Lasso, under certain assumptions 



(iMeinshausen and Yul . l2006l ) . does not suffer from such large accumulated variance due to its 
sparse solution. Fortunately, the two-step procedures only require the weaker £oo-consistency 
to be satisfied. 



IV. Numerical Studies 

We conduct numerical experiments to evaluate finite sample properties of those two-step 
procedures. We consider the usage of univariate regression, OLS regression, ridge regression 
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and the Lasso as initial estimators. These initial estimators are then processed by the 
nonnegative Garrote, adaptive Lasso or hard-thresholding to obtain the final estimator. In 
all experiments we consider the linear model Y = X(3* + e with e ~ A/'(0, cr^/). 



A. Irrepresentable Condition and Variable Selection Consistency 

First we examine how badly the irrepresentable c ondition will affect t he success rate of those 



approaches. We consider an example used in (jZhao and Yul . 120071 ) which is to show the 
relationship between the probability of selecting the true sparse model and the irrepresentable 
condition number ?7oo defined as: 

Voo = l- \\XlXs{X^Xs)-'s\gn{/3*)\\oo. (4.1) 
We use the same setting as in ( Zhao and Yul . 200 tI ) by taking n = 100, p = 32 and s = 5, 



with the true sparse parameter f3g = (7, 4, 2, 1, 1). The noise level cx^ is set to 0.1 to manifest 
the asymptotic properties of the estimators. 

We first sample a covariance matrix S from Wishart(p, Ip), and then each sample is generated 
from A/'(0, S). Such a design matrix X may or may not satisfy the strong irrepresentable 



condition (IZhao and Yul . 120071 ) . and the degree of violation can be represented by the quantity 
rjoo- When r^oo > the irrepresentable condition holds, and when rjoo < oo we expect the 
Lasso to fail in identifying the sparsity pattern for certain cases. We generate 100 designs, 
and compute their corresponding r/oo- For each design, 1000 simulations are conducted by 
generating the noise vector from J\f{0, a^J). For those two-step procedures we use the Ridge 
regression as the initial estimator, for which the tuning parameter i/„ is automatically chosen 
by the generalized cross-validation (GCV). The tuning parameter A„ for the second step is 
chosen optimally over the solution path to find the correct model if possible. For Lasso we 
also select its optimal tuning parameter A* by searching over the whole solution path. The 
advantage of using such A* is that our variable selection results will only depend on different 
methods. 

Figure [1] shows the percentage of correctly selected model as a function of ?7oo, and each 
design is shown as a dot in the plot. It is obvious that variable selection accuracy of the 
Lasso depends crucially on the irrepresentable condition, even for fixed p„. On the other 
hand, results for those two-step procedures are much more accurate in terms of identifying the 
true model. In particular, both the nonnegative Garrote and the adaptive Lasso give almost 
perfect sparsity recovery for this example, with result of the hard-thresholding procedure 
slightly worse. 

B. High- dimensional Variable Selection Accuracy 

The above example illustrates how badly the ir represent able condition is affecting the variable 



selection accuracy for the Lasso. Even worse, IZhao and Yul (120071 ) have shown in simulation 



that the irrepresentable condition fails with very large probability for medium p and s when 
the design is sampled from a general Wishart distribution. 

We further conduct experiment to compare the performance of different variable selection 
methods under a general setting. Similar to the previous example we use GCV to select 
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Percentage of Correctly Selected Model 



Percentage of Correctly Selected Model 



-1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 

Percentage of Correctly Selected Model 




-1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 

Percentage of Correctly Selected Model 




Figure 1: Example A: Percentage of correctly selected model as a function of rjoo for the 
Lasso, NG-Ridge, ALasso-Ridge and HT-Ridge: The tuning parameter i/„ for the Ridge 
initial estimator is chosen by GCV and the tuning parameter is set to the optimal one 
by searching the full solution path. 
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Vn for the initial estimator when apphcable and use the optimal tuning parameter A* for 
the second step as well as for the Lasso by search the full solution path. We let cr^ = 0.5, 
n = 50, p = 16, 32, 64, 128, 256, 512 and for each p we set s = j^p, ^p, . . . , unless it is 
greater than n. For each {n,p,s) combination, we sample 100 times the covariance matrix 
E from a Wishart distribution Wishart(p, /) and for each covariance matrix S we sample 
every (3* {j G S) uniformly from [—2, —0.5] U [0.5,2]. For each S we sample 100 times the 
design matrix X from the multivariate normal distribution A/'(0, S). So in total there will 
be 100 X 100 = 10000 simulations for each method with the same set of {n,p,s). Since we 
observe that results for the nonnegative Garrote and the adaptive Lasso are very similar 
to each other, we only report those of the adaptive Lasso. Also, we only report results for 
which at least one of the compared methods have success rate greater than 0.01. 

In Table [T] the Lasso, HT-Univ and ALasso-Univ perform the worst among all the methods 
even for small p. We believe this is because of their strigent condition on the design matrix 
in order to achieve variable selection consistency. The two-step procedures with the Ridge 
initial estimator perform well especially when s ^ p < n, and those with the Lasso initial 
estimator performs significantly better than others when p > n and j3* is sparse. 



C. Prediction Accuracy and Variable Selection in High Dimensions 

We would like to compare the following procedures: the Lasso, adaptive Lasso with univariate 
regression as initial estimator (ALasso-Univ), adaptive Lasso with Lasso as initial estima- 
tor (ALasso-Lasso), adaptive Lasso with Ridge as initial estimator (ALasso- Ridge), hard- 
thresholding with univariate regression as initial estimator (HT-Univ), hard-thresholding 
with Lasso as initial estimator (HT-Lasso) and hard-thresholding with Ridge as initial esti- 
mator (HT- Ridge). 

To compare their prediction performance, we replicate 200 times in all the examples, and 
each time we generate a training dataset wit h 50 observa t ions a nd a test dataset with 1000 



observations. We use the LARS algorithm (lEfron et al.l . |2004| ) to compute the Lasso and 



adaptive Lasso. The tuning parameter A„ are selected by five-fold cross validation. To 
measure estimation accuracy we use the relative prediction errors (RPE) defined as E[{y — 
a;^/5*)^]/cr^, and for variable selection we use the True Positive (TP) and False Positive (FP) 
which are defined as TP(/5) = J2j^s HPj = 0) and FP{/3) = J2j^sHPj ^ 0). 

Example 1 (Auto-correlated covariance matrix). We set p = 200 and a = 1.5. The covariate 
Xi is sampled from a multivariate normal distribution with mean zero and covariance matrix 
Ej = p\^~^\ with p = 0.5, 0.75 and 0.95. P* is chosen so that there are 15 randomly located 
non-zero elements and the rest elements are zero. Five of the non-zero elements equal to 2.5, 
the second five equal to 1.5, and the last five equal to 0.5. 

The auto -correlation struc ture of the covariance matrix in Example 1 is also used in simula- 



tions in (ITibshiranil . Il996l ) and other Lasso related papers. It is obvious that this example 
is not location invariant to the variables, that is why the sparsity pattern of (3* is random- 
ized. Because of the high dimensionality and the degenerating feature of the covariance 
matrix, most of the variables are weakly correlated. Our next example has moderate to high 
correlations among all the variables. 
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Table 1: Success rate of model selection with optimally chosen A* in the second step 



v 

f 


s 


Lasso 


Univ 


HT- 
OLS Ridge 


Lasso 


Univ 


ALasso- 
OLS Ridge 


Lasso 


16 


1 


0.9923 


0.9787 


0.7874 


0.9816 


0.9988 


1 


0.9257 


0.9962 


0.9998 


16 


3 


0.4927 


0.0827 


0.7501 


0.906 


0.9948 


0.4898 


0.8596 


0.9649 


0.9965 


16 


5 


0.2725 


0.024 


0.6721 


0.8614 


0.9795 


0.2104 


0.7919 


0.9455 


0.9851 


16 


7 


0.1382 





0.6734 


0.7846 


0.9529 


0.0786 


0.7859 


0.8713 


0.9557 


16 


9 


0.0752 





0.6392 


0.7942 


0.8964 


0.0602 


0.6875 


0.8253 


0.8896 


16 


11 


0.1103 


0.0005 


0.657 


0.7829 


0.8354 


0.051 


0.671 


0.7785 


0.8139 


16 


13 


0.173 


0.006 


0.7216 


0.8336 


0.7953 


0.0957 


0.7088 


0.7896 


0.763 


16 


15 


0.5457 


0.0653 


0.7798 


0.8443 


0.6788 


0.55 


0.6863 


0.7489 


0.6512 


32 


2 


0.9024 


0.5606 


0.6077 


0.9714 


0.9993 


0.9212 


0.8297 


0.9952 


0.9994 


32 


6 


0.2027 





0.5374 


0.8623 


0.9952 


0.135 


0.7211 


0.9536 


0.9974 


32 


10 


0.0036 





0.4137 


0.744 


0.9898 


0.0021 


0.5769 


0.88 


0.9924 


32 


14 


0.0007 





0.437 


0.7197 


0.9645 


0.0005 


0.5728 


0.8219 


0.9713 


32 


18 


0.0003 





0.4186 


0.6984 


0.9132 





0.5081 


0.7645 


0.9095 


32 


22 


0.0009 





0.4536 


0.6969 


0.8045 


0.0001 


0.4829 


0.67 


0.7657 


32 


26 


0.014 





0.4827 


0.6689 


0.6398 


0.0006 


0.4346 


0.5419 


0.5627 


32 


30 


0.1373 


0.0101 


0.5329 


0.6951 


0.5036 


0.0662 


0.3919 


0.1669 


0. 1084 


64 


4 


0.5752 


0.0792 


NA 


0.8592 


0.9993 


0.4906 


NA 


0.9929 


0.9999 


64 


12 








NA 


0.0662 


0.9962 





NA 


0.3656 


0.9994 


64 


20 








NA 


0.0086 


0.9182 





NA 


0.0648 


0.9257 


64 


28 








NA 





0.3995 





NA 


0.0026 


0.4009 


64 


36 








NA 





0.0518 





NA 


0.0058 


0.0489 


64 


44 








NA 


0.0023 








NA 








128 


8 


0.0194 





NA 


0.0048 


1 


0.02 


NA 


0.2633 


1 


128 


24 








NA 





0.02 





NA 





0.02 


256 


16 








NA 





0.01 





NA 





0.01 



Example 2 (Constant-correlated covariance matrix). We use the same model as in Example 
1 except that the covariance matrix has constant correlations, Ej^fe = r with r = 0.3, 0.6 and 
0.85. 

The next example divides X into two orthogonal blocks Xa and X^c, so that E^ca = 0. 
Notice that when A = 5", we have ^X'gcXs = Op{l/y/n), which is the random version of the 
partial orthogonal condition for univariate estimator to be a zero-consistent initial estimator. 
Wc allow Xa to be a superset of Xs, i.e. A ^ S. 

Example 3 (Generalized partial-orthogonal covariance matrix) We use the same model as in 
Example 1 except that the first 15 elements of f3* are nonzeros and the covariance matrix 
has S^c^ = 0, where A includes the first a columns of the X and a is chosen as a = 15, 50 
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and 85. All the other elements in S equal to constant 0.6. 

Table 2: Comparing the Median RPE for Example 1 and 2 based on 200 replicationsf 



Example 1 


p = 0.5 


p = 0.75 


p = 0.95 


Lasso 


4.8605 


;0.1783) 


4.0082 


;0.1858) 


2.0365 (0.0639) 


ALasso-Univ 


5.5771 


;o.i9ii) 


4.6060 


^0.2774) 


2.1481 (0.0658) 


ALasso-Lasso 


4.5650 


;0.3076) 


4.2502 


;0.1729) 


2.5520 (0.0874) 


ALasso- Ridge 


5.7029 


;0.3080) 


4.3574 


^0.2054) 


2.0348 (0.0461) 


HT-Univ 


17.437 


;0.2782) 


20.706 


^0.3246) 


31.185 (0.3568) 


HT-Lasso 


4.4008 


;0.1597) 


3.5554 


;0.1634) 


2.1139 (0.0795) 


HT-Ridge 


12.122 


;0.1614) 


7.6580 


;0.072) 


2.4881 (0.0461) 


Example 2 


r = 0.3 


r = 0.6 


r = 0.85 


Lasso 


3.3186 


;0.1287) 


2.7097 


^0.1270) 


1.8638 (0.0443) 


ALasso-Univ 


2.9293 


;0.1677) 


2.5825 


^0.0872) 


1.8932 (0.0599) 


ALasso-Lasso 


3.5494 


;0.1467) 


3.1304 


;0.1381) 


2.2803 (0.0655) 


ALasso- Ridge 


3.2561 


;0.2003) 


2.9072 


^0.1262) 


1.8113 (1.8113) 


HT-Univ 


66.440 


;0.8814) 


68.643 


^0.6295) 


35.778 (0.2719) 


HT-Lasso 


3.0911 


;0.1202) 


2.6122 


;0.1538) 


1.7893 (0.0463) 


HT-Ridge 


9.4863 


;0.0707) 


5.5772 


;0.0509) 


2.2337 (0.0184) 


Example 3 


a = 15 




a = 50 




a = 85 


Lasso 


0.7355 


;0.0179) 


1.3032 


;0.0270) 


1.7859 (0.0399) 


ALasso-Univ 


0.6344 


;0.0166) 


1.6008 


;0.0293) 


1.7467 (0.0442) 


ALasso-Lasso 


1.1769 


;0.0203) 


1.7082 


;0.0422) 


2.1347 (0.0684) 


ALasso- Ridge 


0.7217 


;0.0206) 


1.3450 


;0.0322) 


1.7208 (0.0361) 


HT-Univ 


50.623 


;0.4758) 


57.708 


^0.4922) 


59.435 (0.5186) 


HT-Lasso 


0.7438 


;0.0162) 


1.5345 


;0.0434) 


1.9722 (0.0466) 


HT-Ridge 


3.8136 


;0.0564) 


6.0480 


^0.0684) 


7.3449 (0.0511) 



t The numbers in parentheses are the corresponding standard errors of RPE calculated 
from 200 bootstrapped sample medians. 

In Table [2] Example 1, when p = 0.5 and 0.75, Lasso has better RPE than those of ALasso- 
Univ and ALasso-Ridge. The ALasso-Lasso and HT-lasso, which uses the Lasso as initial 
estimator, is also relatively good. This result is expected since the Lasso is good at dealing 
with situations when s <^ p. When p = 0.95, the ALasso-Univ and ALasso-Ridge catch 
up with the latter slightly better than all the other methods. In Example 2, when r = 0.3 
and 0.6, the ALasso-Univ has better RPE than other methods. When r = 0.85, Alasso- 
ridge catches up and outperforms Lasso and other adaptive procedures. In Example 3, when 
a = 15, the partial orthogonal condition for univarate estimation is satisfied and Alasso- 
univ performs the best. As a increases to 50, this condition is violated and Alasso-univ 
deteriorates faster than Lasso and other adaptive methods. In theses two cases. Lasso and 
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Table 3: Median number of selected variables for Example 1 and 2 based on 200 replicationsf 



Example 1 


P = 
TP 


0.5 

FP 


P = 
TP 


0.75 
FP 


P = 
TP 


0.95 
FP 


Lasso 


ii 


1 o 

io 


ii 


Z4 


y 


2o 


ALasso-Univ 


10 


15 


10 


20 
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24 




10 


11 


10 


19 


8 


20 
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Q 


94 
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n 


1 

1 


1 


n 


1 


HT-T,awn 
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1 1 

1 1 


1 7 

1 1 


1 1 


93 


8 


1 7 

1 1 


HT-RiHo-p 

11 J- IVlU-gjC 


1 3 

i-O 


79 


1 9 


66 


1 9 

1 


65 


Example 2 


r = 
TP 


0.3 


r = 
TP 


: 0.6 

FP 


r = 
TP 


0.85 
FP 


Lasso 




Zo 


ii 


OO 

zo 


iO 


2 <^ 


ALasso-Univ 


12 


26 


11 


27 


10 


27 




12 


24 


11 


24.5 


Q 


22 


A T , fi cco— R 1 H CTfi 
TT-Ij cLooiJ- 1 VI U-^C 


1 


95 


1 1 


95 


1 

lU 


94 


HT-TIniv 

11 ± 111 V 


n 


1 


n 


1 

1 


n 


1 

1 


11 ± UCLooVj 


1 1 


1 Q 

1 C/ 


1 

1 u 


1 6 

lU 


8 

O 


1 4 

It: 


11 ± IVIU-^C 


1 


53 


1 9 

1 


53 


1 9 

1 


60 5 
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Example 3 


a = 
TP 


15 


a = 
TP 


= 50 
FP 


a = 
TP 


= 85 
FP 




14 


8 


13 


18 


13 


22 


ALasso-Univ 


14 


6 


14 


15 


13 


19 


ALasso-Lasso 


14 


6 


12 


11 


12 


14 


ALasso-Ridge 


15 


6 


13 


15 


12 


19 


HT-Univ 


1 





1 








1 


HT-Lasso 


14 


2 


13 


10 


12 


17 


HT-Ridge 


15 


2 


15 


36 


15 


80 



t "TP" represents the median number of correctly selected variables, whereas "FP" 
represents the median number of incorrectly selected variables. 

Alasso-ridge has similar RPEs. When a increases to 85, Alasso-ridge outperforms all the 
other methods. 

Hard-thresholding as another type of procedure that has different performance. HT-Univ 
has large RPE because of the large bias of the univariate regression. HT-Lasso however has 
good performance through all the cases. HT-Ridge shows up in the middle. 

The variable selection results in Table [3] do not show as dramatic difference as we saw in 
previous examples where we choose the optimal A* by searching the full solution path. One 
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of the reason is that we use prediction error as the criterion to select A„ in the second step. 
Such a criterion, although could lead to go od predictio n accu racy, may not be ideal for the 
purpose of variable selection. For example, Leng et aD (j2006l ) showed that the Lasso is not 
variable selection consistent in general when prediction accuracy is used as the criterion for 
selecting the penalty parameter. The development of an effective data-driven approach for 
selecting A„ is an interesting future research topic for variable selection. 



D. Real Data 



We study the behavior of previous methods in one real dataset to examine their predictive 
power. In particular, we examine the prediction accuracy of all methods as a function of 
sparsity level, i.e. the number of selected variables in the final model, by changing the tuning 
parameter A„. The tuning parameter for the initial estimator is chosen automatically by 
GCV for methods HT-Ridge, HT-Lasso, ALasso-Ridge and ALasso-Lasso. 

We consider the Boston Housing data, which contains 506 records about housing values 
in suburbs of Boston. Each record has 13 continuous features which might be useful in 
describing housing price, and the response variable is the median house price. We use all 
13 features as well as second order terms except for one binary feature. This results in 
a total of 91 predictors. In our experiments, we randomly split the data into a training 
set with 100 records and a test set with 406 records. We perform the random spliting 
1000 times and report the average mean squared error as a function of the sparsity level 
of the selected model. Results are shown in Table HI From the result we can see that the 
Lasso does not perform well when the sparsity level is small. This is because of the high 
bias for the selected variables caused by a relatively large penalty A„. On the other hand, 
those two-step procedures (except HT-Univ) do not suffer from such a problem and perform 
better when the sparsity level is low. As the number of selected variables increases, most 
methods perform reasonably well. The HT-Univ performs very poorly compared to the 
other two-step procedures. This is expected as the univariate estimator is not good and the 
hard-thresholding procedure simply cuts at a particular threshold without any data refitting. 



V. Concluding Remarks 



This paper studies high-dimensional variable selection problems for linear models. In partic- 
ular, we study the properties of several two-step procedures including the nonnegative Gar- 
rote, adaptive Lasso and hard-thresholding given some good initial estimator. Our results 
give the condition about (n,p„,s„, A„) under which both adaptive Lasso and nonnegative 
Garrote can turn an l^o con sistent initia l estimator into a final estimator that has the oracle 
properties as introduced by iFan and Lil (120011 ). We then show that the Ridge estimator is 
£oo-consistent under some relaxed identifiable condition involving (3* and X^X. Such a con- 
dition is usually satisfied when Pn n and does not require the partial orthogonal condition 
needed for the univariate regression. Our simulation results show that equipped with the 
Lasso and Ridge estimator as initial estimators, those two-step procedures have a higher 
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Table 4: Performance of the methods as a function of sparsity level on the Boston Housing 
data 
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70.3756 


401.654 


34.2651 


33.8133 


39.78626 


4 


67.8498 


117.6729 


73.1718 


261.236 


29.8815 


30.4250 


35.67528 


5 


49.7597 


181.2533 


71.9849 


342.206 


27.5801 


28.1729 


32.65329 


6 


40.7595 


254.4535 


68.3086 


271.984 


26.2826 


26.3106 


29.61199 


7 


39.6022 


337.3792 


68.3103 


310.752 


25.1754 


25.4851 


29.16682 


8 


39.4237 


426.7497 


70.1900 


387.369 


24.2619 


24.5633 


29.09486 


9 


33.9314 


521.7511 


71.4839 


525.376 


23.8592 


23.6182 


27.54199 


10 


31.5190 


617.0682 


74.9066 


401.048 


23.5533 


23.5084 


27.00174 



success rate in terms of sparsity recovery than the Lasso and the adaptive Lasso with the 
univariate regression. Resuhs for high-dimensional estimation with correlated covariates and 
real data are also encouraging. Finally, it should not be difficult to extend our results to 
non-normal errors which have a light-tailed distribution. 



VI. Appendix 

Proof of Lemma \2.1[ 

The nonnegative Garrote is a convex optimization problem with a quadratic loss and pn 
linear constraints. By standard results from convex optimization we know is a solution of 
the nonnegative Garrote problem if and only if there exist a = (ai, . . . , ap„)'^ > such that 

-Z^Zd- -Z^Y + A„l - « = (6.1) 

n n 

and aj = if dj > 0. 

Since d exactly recovers the sparsity pattern if and only if rf^c = and ds > 0, combining 
these conditions with the above optimality condition we have that the nonnegative Garrote 
solution d exactly recovers the sparsity pattern implies 

-Z^Zd--Z^Y + X:^l = (6.2) 
-ZLZd- -ZLY + A„l > 0. (6.3) 
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Since Y = X[3* + e = Xs[3*g + e and Zd = Zsdsi plugging in we have 

^ZlZsds-^ZlXsl3*s-^Zle = -A„l (6.4) 

—ZgcZsds Z'^cXsPs ^S'^^ — "-^nl- (6.5) 

n n n 

Solving the above equations we have 

ds = (-Z^Zs) (-Z^Xs(3*s + -Z^e-Xnl] (6.6) 



and 



^Zl.Zs{ZlZs)-^Zle - XnZlZs{Z^Zs)-'l - ^Z^e + A„l > 0. (6.7) 



Now utihzing the fact that rf^ > we obtain the claimed result. □ 
Proof of Theorem \2.B. 

We only need to show lim„^oo -P(supp(/5^'^) = supp(/3*)) = 1 as we have > and 
sign(/?5) = sign(/?J) as n ^ oo by assumption. Recall that we have diagonal matrix 
A* = diag(/?*, . . . , /?*^) and correspondingly A = diag(/?i, . . . , /5p„). We also use the no- 
tation A^ and Ag to denote the sub-diagonal matrices of A* and A which only contains 
rows and columns whose indices belong to the set S. First, A5 is invertible with probability 
tending to 1 since 

P [rm^\i3j\ > 0^ ^ 1. (6.8) 
as 6n = o{pn). In the following we assume that A^ is invertible. 

Define random variables Vj = Xje/n for j = 1, . . . ,p„ and consider the events A and B 
given by 

A = f^Ul^A.J^°^EElA\ (6.9) 
B = njlV.K^a/^if^^ (6.10) 



where A is some constant that satisfies A > \pi. By the normal error assumption we have 
VnV,- ~ Ar(0,a2), and 

P{A') < Yl Pi^^j > Aa^/\og{pn - Sn)) (6.11) 

< (Pn - Sn)P{\W\ > A^/\0giPn - S„)) (6.12) 

< ^^Z^^,,J^^^M^ILZlA) (6.13) 



A^y\0g{pn - Sr. 

< ^^^= (6.14) 

A^y\og{pn - Sn) 
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where is a standard normal variable and the last inequality is by Mill's inequality. Simi- 
larly we have 



P{B') < SnPi\W\>Ay/h^) 

logs. 



< 
< 



1 

^ Vlog Sn 



exp 



(6.15) 
(6.16) 

(6.17) 



Since by our choices of events A and B we have P{An B) — > 1 as p„ > s„ — > oo, the 
following analysis will only focus on the event A C] B. In particular, under event A we 
have the bound ||Xjce/n||oo < Aa^^\og{pn — Sn)/n and under event B we have the bound 
II^Je/'T'lloo < Aa^J\ogSn/n. 

(1) We first show that the probabihty of under-selection converges to zero, and it suffices to 
show that 



n 



n n 



(6.18) 



with probability 1. 
Since Zs = we have 

ds = As'l3*s + f A^-XjXsAs ) Apx^e - A, ( A'PX^XsA^ ) 1. (6.19) 



n 



n 



n 



Obviously the first term converges to 1 with probability 1 at a rate Op(5„) since 6n = o{pn). 
For the second term we have 



-1 



< 



n ^ 



(6.20) 



as long as ^ a/s„ log s„/ n — > 0. 
For the last term we have 

An (Aq—X^XsAs 
\ Tl 



< 



-^min 



Or, 



pI 



(6.21) 



Combining three terms together we have — > 1 with probability 1 if A„ = o{p'^/ -^/s^) and 
^Jsn logs„/n 0. 

(2) We show that the probability of over-selection converges to as well. First, Define 

W = -Zje (/ - ZsiZ^ZsY^Zl) e + KZl.Zs{ZlZs)-^l (6.22) 



n 
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and there is no over-selection if max^g^c Wj < A„, which is further imphed by the event 



oo < A„. We have 



< 



1 



(6.23) 



CO 



— ZgcZs{Zg Zs) Zgt 



\n\\Zl.Zs{ZlZs)-'l\\oo (6.24) 



n 



+ 0, 



Pn 



(6.25) 



Thus on the event ^fl we have ||oo < A„ as n — >• oo as long as ^^Jlogpn/n — >■ 0. The 
result now follows by combining (1) and (2). □ 

Proof of Theorem \2.3\ . 



This theorem can be verified in a similar way as in the proof of Theorem 2 of (iHuang et al 



20081 ). By Theorem [221 P{ds- = 0) ^ 1 and P{ds 0) ^ 1, then, the KKT condition 
implies 



-ZgZs 1 ds 



1 



ZgY — — A„l. 



n J n 

Plug in y = XsP*s + e, Zs = XgAs and dg = (A^)"^/^^^, we have 



(6.26) 



n 



n 



(6.27) 



then 



V^vl - Ps) = n~'/'v: (ix^Xs) " Xje - V^X^v^ (^^JX^) " (A,)-4. (6.28 



Since 



< VnXn\\Vn\ 



n 



-1 



-XlXs (As)-^l 



< ^Jns:^XnA.^,^ (^Amin(A5) 

< ^^XnKlnPn^i^ + Opil)), 

then, under condition f l2.13p . we have 

V^Wn'vl - pi) = n-'l^wl^vl {^-XlX^ Xle + o,(l). 



(6.29) 

(6.30) 
(6.31) 



(6.32) 



Next, we verify the conditions for Linderberg- Feller central limit theorem. Let 



V, = n-y'w-h^(^^X^Xs^ (6.33) 
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and Wi = Viei, then it is easy to show that 

y^r(j2^A=a'j2y" = ^- (6-34) 

\i=l J i=l 

On the other hand, 

n n 

Y,^[WMW,\ > 5)] = a'J2^^^ b'H\V^e.\ > S)] < max E [e?l(|V^.e.| > S)] , (6.35) 

i=l i=l 

then it is enough to show that 

max E k^l(|V^ei| > S)] 0, (6.36) 

l<i<n 

or equivalently, 

-1 



max = n ^^"^w^^ max 

l<i<n l<i<n 



n 



(6.37) 



Since 

vl{^XlXs^ < i^l{^XlXs^ i^ls){^XlXs^ x.(5)j(6.38) 

< a-^WnA;^]'^ {x\s)^,(s)f''^ , (6.39) 
then under assumption (12.140 . (I6.37P follows. This finishes the proof. □ 



Proof of Lemma \2.4 



By assumption we have Ps ^ thus p^^"-^'^" exactly recovers the sparsity pattern if 
and only if d does so. By the KKT condition, ci is a solution if and only if there exists a 
subgradient 'z G dii (d) such that 

-Z^Zd--Z^Y + \nZ = (6.40) 
n n 

where = s\gn{dj) for dj ^ and < 1 otherwise. Then it follows that d (and thus 
I^ALasso^ exactly recovers the sparsity pattern if and only if ds<^ = 0, ds 0, l^^cl < 1 and 
zs = s\gn{d*s). 

Combining these conditions with the above optimality condition we have that the adaptive 
Lasso solution recovers the sparsity pattern implies 

-ZjZrf - -Z^Y + XnZs = (6.41) 
izJcZrf - izj.y + A^Ssc = 0. (6.42) 
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Since Y = Zd* + e = Zsdg + e and Zd = Zsds, plugging in we have 



1 



1 



—ZsZsds ZgZsdg Z^e 



n 



n 



ZgcZsds — —Z^cZsdg — —Zgce 



n 
1 



n 



-XnS\gn{d*s) 

— XnZsc. 



(6.43) 
(6.44) 



Solving the above equations we have 

1 



d, 



d% + 



n 



ZsZs 



-1 



n 



Zgt- XnS\gn{d*s] 



-Xnzs^ = Z^ZsiZ^Zs) ' (^^Zje - A„sign(4) 



n 



7'^ f 



(6.45) 
(6.46) 



and the result follows since \ds\ > and l^^cl < 1. □ 
Proof of Theorem 1^.51 

The proof is similar to that of Theorem 12. 2[ Without loss of generality, assume that A is 
invertible and define events A and B as before. We only need to consider the situation when 
An B is true. 

(1) We have d*g ^ 1 since Sn = o{pn)- As in Theorem 12.21 we have 

1 



—ZsZs 
n 



-zh 



< 



-Xje 
n ^ 



(6.47) 



as long as Pn^^/sn log s„/ n 0. Also we have 



n 



Xn ( —ZgZs 



sign(c?c 



< 



O 



P \ 2 



(6.48) 



Thus we have if A„ = o(p^/ ^/s^) and Pn^^/sn log Sn/n — * 0. 

(2) Define W = ^Zj. (/ - Zs{Z^ Zsy^Z^) e + XnZ^.Zs{Z^ Zs)-^s]gn{d*s) which is the same 
as the random vector W in the proof of Theorem 12.21 except that 1 is replaced by sign(c/^). 
Thus we have ||Vr||oo < Xn if 6n^/iogpn/n ^0. □ 

Proof of Theorem \2.(A 



By Theorem 12. 5[ we have P{ds<: = 0) — 1 and P{ds 7^ 0) ^ 1. Then the KKT condition 
implies 



n J \ / n 

and the rest follows exactly as the proof of Theorem 12. 3[ □ 



AsXje - A„sign(c?c 



(6.49) 



Proof of Theorem \2. ? 
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For all j ^ S, i.e. j such that P* = 0, we have 

P (max 1^,1 > A„ ) = P (max > XJdA ^ 

since 5„ = o(A„). By the hard-thresholding rule, we have P{f3gc^ = 0) ^ 1. 
For all j G S, i.e. j such that /5* 7^ 0, we have 

P (^inf 1^,1 > > P (inf - - Z?*!) > > P (p„ - max - /?;| > A„^ 

since infjgg > p„ — maxjg5 — The right hand side converges to 1 as long as 
A„ = o{pn)- As a result, we have Pif^g^ = (3s) = 1. □ 

Proof of Theorem \3.1[ 

First, notice that {p^^'^ae _ 

is a random vector which follows a multivariate normal 

distribution with mean 

- Un {-X^X + vj\ 13* (6.50) 



and covariance matrix 



n 



Var (^^^'^^^ - = "1 (^-X^ X + vj^ X^ X [^X"" X + vj^ (6.51) 
= — I (-X^X + z/„/) - z/„ ("-X^X + z/„ A I . (6.52) 

Let m be the mean vector and C be the covariance matrix of (^fj^^'^se _ respectively, and 
define rh = maxj \mj\ and C = maxj Cjj to be the uniform upper bound of the individual 
bias and variance. 

Define event S to be 

Pn 

£=f]{ - P*\ < V2C\ogpn + m} , (6.53) 



then we have 



PiS") < Yl P (l^f - ^1 1 > V^^C- logp„ + m) (6.54) 

Pn 

< Y.p{\pf^''-f3*-m,\>^2C~^r) (6.55) 



= PnP(\Z\ > v/21ogp„) (6.56) 

< exp (-logpn) ^ 0. (6.57) 
V21ogp„ 
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where Z ~ AA(0, 1) is a standard normal random variable. So we only need to consider the 
situation on the event E. In other words, we need to bound the quantity \/lC logp„ + m. 

We first compute C. Define i5max(C) to be the operator which returns the maximum diagonal 
element of C, and recall that Ajnax(C) is the maximum eigenvalue of matrix C, we have 

C = -D^^J(-X^X + iyj) -u^f-X^X + iyJ] ] (6.58) 
< ^D^^J (^X'^X + uj) I (6.59) 




< ^An..J (-X^X + uj] I (6.60) 

2 

< — . (6.61) 

nur,. 



Next we bound m. Since we have 



1 



X^X = UDU^ 



n 



(6.62) 



where U G M^^^ is an orthogonal matrix and D = diag((ii, d2, ■ ■ ■ ,dq,0, . . . ,0) is a diagonal 
matrix with di > d2 > . ■ ■ > dq > 0. Note that q < n since D has at most n nonzero 
elements, and we also use D~ to represent the pseudo-inverse of D. Let columns of U be 
ei, . . . , ep^. By Assumption [2] we have /3* = {^X'^X)h + Yfj^g_^_^ OjBj for some vector b e 
and||E^:,+i^^.e,IU = 0(e„). 

Thus we have 



m 



< 



-iy„ 



-X^X + vj 

n 



\u^U{D + uJ)-'U^f3*^ 



VnU [D + vjy^ U^-X' Xh 



n 



\unU {D + vjy^ DD-DU^h\ 



< \\vnU {D + ujy^ DD-DU'h 



Pn 



j=q+l 



< A^ax {i^n (D + ujy^ D) \\D-DU^h\\^ + 0(en) 



< o 



Vn + di 



d„ 



|D-i?t/^b|l +o(e„) 



(6.63) 
(6.64) 

(6.65) 

(6.66) 

(6.67) 
(6.68) 

(6.69) 
(6.70) 
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The last inequality comes from the fact that 



\\D-DU' b||2 = \\D-U' f3* ~D-[0,..., 0, ^,+1, . . . , e,,y h < -fO{^) 

since the true parameter (3* is assumed to be sparse with only s„ number of nonzero elements. 
Combining the above steps we get that on the event S, we have 



pRidge _ p* 



< v2Clogp„ + m 



< 



logPn + 0{ + ^r. 

nur,. \ da 



as long as 



and 



^ 

0. Furthermore, if ^„ ^ sufficiently fast, we have 
s;;iogp, ^ 



(6.71) 

(6.72) 
(6.73) 




(6.74) 



by setting z/„ = (^^)V3. □ 



Proof of Corollary \3.S[ 

We only need to consider the special case of orthogonal design where -X^X = Ip^. In order 
to have the orthogonal design, we need to have pn <= n. Suppose pn = n/2 and the design 
is orthogonal such that ^X^X = Jp„. 



Then we have 



1 + UnU 



-X^e. 



(6.75) 



As a result, in order for f^^^'^d^ ^2-consistent estimator of /?*, both the first and 

second term need to disappear. The first term goes to for arbitrary (3* only if A„ 0, 
and in this case we need ||X^e/?T,||2 = Op(l) to ensure the ^2-consistency. However, we have 
E(X'^e/n) = and Var(X-^e/n) = ^Ip„- Consequently, ||X'^e/n||2 = Op{l) since Pn = n/2 
and the proof is completed. □ 
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